Why Do We Need Classification Metrics?

Consider the following situation:

You're at the doctor's office, and the doc rolls up and introduces herself. As she explains, she's adopted a unique approach to diagnosing patients. For each patient, her prognosis is that they don't have cancer, despite whatever symptoms they may exhibit.

You're confused and somewhat concerned, but she assures you she has a track record of 99.9% accuracy.

After some quick googling, you find out cancer occurs in less than 0.1% of the total population.

Do you stay or get the heck outta there?

The doc isn't lying. She may have a 99.9% accuracy, but her diagnosis method is completely useless.

Out of the patients who did have cancer, she didn't predict a single one of them right. You should probably find a new physician.

Accuracy isn't a great measure when it comes to prediction. We need something else — another set of metrics that describe how well the doctor picks up potentially positive cases, or ignores potentially negative cases.

Here's what we'll learn today:

  1. Compute prediction probabilities, or the certainty behind a prediction
  2. Identify misclassifications — False Positives and False Negatives
  3. Build better metrics using misclassification rates
  4. Optimize models by tuning certain parameters
  5. Compare the ability of models to discriminate classes using ROC curves

Making Our Own Classifier

Imports

We'll start with the usual imports, and set some options so we get consistent results when generating random data

Generate Data

Let's make some data! Let's start out with about 5000 observations to get us started.

Here we're using the make_blobs() function to create two classes with similar attributes.

How many observations are in each class?

This 50-50 split means we have balance in our target class

Exploration

What patterns do you see?

My strategy: Draw a boundary at x = 10, since this best seperates the classes (As seen from the histogram)

One thing to note: As we move farther from the boundary, the certainty of our prediction gets stronger

In other words, the probability that a particular observation is of class 1 increases as you move further to the right

Define a Classifier

Let's use this boundary to make our own little classifier. Not from sklearn, just on our own!

You try! Create a predict() function. In other words:

  1. Take each value from the relevant feature (X_test is a dataframe)
  2. Compare it against the boundary
  3. Make a decision of 1 or 0 based on that — make sure this is an int dtype

As usual, we'll split up our full dataset into training and testing sets

Now, we use our little boundary classifier just as we would any sklearn model

Evaluate Performance

Let's take a look at some of the predictions by putting them side by side

This makes sense — For most of our observations, the predicted value matched the actual.

Note that the predicted probability coresponds with the predicted class. A more extreme probability (closer to 0 or 1) means that the data point was further from the boundary, so we're pretty certain which class it belongs to.

Question: How many test observations did we get right? In other words, what's the accuracy of our model?

Question: Of the actually positive observations, how many did we correctly predict as positive?

Here's a quick helper function to help label the confusion matrix

We can also grab the counts of each category

How would you calculate Sensitivity (aka True Positive Rate — TPR)?

What does the following figure describe?
I.e. What metric do we call the red / (gray + red) dots included here?

We can visualize which datapoints were misclassified in the following plot

Class Imbalance

Now let's model the situation from the doctor's office, with a heavy class imbalance.

Here, we'll use a 95-5 split, with only 5% of the target labels being of the positive class

Generate Imbalanced Data

To create this new dataset, we'll resample the old data, but only select proporotions that will generate the desired balance

Note that the resulting dataset has half the original observations (we threw out quite a bit of the positives)

To explore the data, let's take a look at the histograms, both before and after

Create Model

To model our doctor's "evidence-blind" method, we can create what's known as a "dummy" classifier.

There's different dummy strategies, i.e. randomly predict 1 or 0 with 50% chance each ('50-50'), or always choose a class based on what we tell it ('hard-coded'), or the highest frequency category ('most-freq').

All of these ignore the feature data

Create new train/test splits from our new data: We're just adding a _new suffix onto everything

Now let's use the dummy classifer and specifiy which value to predict (0 or negative) for each observation.

Use a _dummy suffix for the classifier and array outputs here

We're gonna try another dummy model — but one that uses a strategy of predicting 50-50

Use a _rando suffix for the classifier and array outputs here

Finally let's create a legitimate boundary classifier

Use a _bc suffix for the classifier and array outputs here

Evaluate Model Performance

Let's use our confusion matrix function from before to see the results

With dummy classifier

With the random classifier

With boundary classifier

Improving Our Model

Adjust Probability Threshold

Let's take a look at our original data classifier on the balanced dataset.

In our code before, we implicitly said that "if the predicted probability of being 1 is greater than 0.5, go ahead and classify it as 1.

A more explicit function call:

What if we were to decrease the probability threshold?

I.e. If we lowered the evidence needed to call a particular observation as 1, would we now be predicting more or less 1s?

Clearly, we see more observations are being moved into the Predicted 1 column. How does that impact sensitivity? What about specificity?

You try! What would happen if we were to increase the probability threshold?

Optimization

If the cost of a false negative were less than for a false positive, i.e. not calling out a positive case was really bad, which alpha (of the ones above) would you choose? Why?

ROC Curve

Create helper functions

Run on BoundaryClassifier

Run on Rando Classifier

As you can tell, the ROC is what allows us to really differentiate crappy models from good ones.

What does the 45-degree line tell us?

Run on Dummy Classifier

BoundaryClassifier on the 50-50 dataset

AUC ROC

In a single metric, we can summarize the curve by the area under curve, or AUC (of the ROC curve)

We'll use the sklearn.metrics.auc function for this

You Try!

Try creating either a KNN or Decision Tree classifier on the balanced (original dataset).

You can use the existing train test splits, but make sure to:

  1. Fit / Train the model
  2. Generation predictions and probabilities
  3. Show confusion matrix
  4. Show ROC